Goto

Collaborating Authors

 data processing task


Generative AI for Research Data Processing: Lessons Learnt From Three Use Cases

Mitra, Modhurita, de Vos, Martine G., Cortinovis, Nicola, Ometto, Dawa

arXiv.org Artificial Intelligence

--There has been enormous interest in generative AI since ChatGPT was launched in 2022. However, there are concerns about the accuracy and consistency of the outputs of generative AI. We have carried out an exploratory study on the application of this new technology in research data processing. We identified tasks for which rule-based or traditional machine learning approaches were difficult to apply, and then performed these tasks using generative AI. We demonstrate the feasibility of using the generative AI model Claude 3 Opus in three research projects involving complex data processing tasks: 1) Information extraction: We extract plant species names from historical seedlists (catalogues of seeds) published by botanical gardens. We share the lessons we learnt from these use cases: How to determine if generative AI is an appropriate tool for a given data processing task, and if so, how to maximise the accuracy and consistency of the results obtained. In this paper, we share our insights on the application of generative AI in research software engineering projects. Generative AI can potentially be used to perform a wide variety of research data processing tasks, such as interpreting documents, extracting information from them, and classifying text into categories. Since the tasks are specified through prompts in natural language, the barrier to entry is low. Therefore, this tool can be easily used by domain experts in a wide range of fields, with varying levels of programming skills and depth of knowledge of technical topics such as machine learning.


GPUs, CPUs, and... NICs: Rethinking the Network's Role in Serving Complex AI Pipelines

Wong, Mike, Butler, Ulysses, Farkash, Emma, Tammana, Praveen, Sivaraman, Anirudh, Netravali, Ravi

arXiv.org Artificial Intelligence

The increasing prominence of AI necessitates the deployment of inference platforms for efficient and effective management of AI pipelines and compute resources. As these pipelines grow in complexity, the demand for distributed serving rises and introduces much-dreaded network delays. In this paper, we investigate how the network can instead be a boon to the excessively high resource overheads of AI pipelines. To alleviate these overheads, we discuss how resource-intensive data processing tasks -- a key facet of growing AI pipeline complexity -- are well-matched for the computational characteristics of packet processing pipelines and how they can be offloaded onto SmartNICs. We explore the challenges and opportunities of offloading, and propose a research agenda for integrating network hardware into AI pipelines, unlocking new opportunities for optimization.


From PDFs to Structured Data: Utilizing LLM Analysis in Sports Database Management

Merilehto, Juhani

arXiv.org Artificial Intelligence

This study investigates the effectiveness of Large Language Models (LLMs) in processing semi-structured data from PDF documents into structured formats, specifically examining their application in updating the Finnish Sports Clubs Database. Through action research methodology, we developed and evaluated an AI-assisted approach utilizing OpenAI's GPT-4 and Anthropic's Claude 3 Opus models to process data from 72 sports federation membership reports. The system achieved a 90% success rate in automated processing, successfully handling 65 of 72 files without errors and converting over 7,900 rows of data. While the initial development time was comparable to traditional manual processing (three months), the implemented system shows potential for reducing future processing time by approximately 90%. Key challenges included handling multilingual content, processing multi-page datasets, and managing extraneous information. The findings suggest that while LLMs demonstrate significant potential for automating semi-structured data processing tasks, optimal results are achieved through a hybrid approach combining AI automation with selective human oversight. This research contributes to the growing body of literature on practical LLM applications in organizational data management and provides insights into the transformation of traditional data processing workflows.


Automated data processing and feature engineering for deep learning and big data applications: a survey

Mumuni, Alhassan, Mumuni, Fuseini

arXiv.org Artificial Intelligence

Modern approach to artificial intelligence (AI) aims to design algorithms that learn directly from data. This approach has achieved impressive results and has contributed significantly to the progress of AI, particularly in the sphere of supervised deep learning. It has also simplified the design of machine learning systems as the learning process is highly automated. However, not all data processing tasks in conventional deep learning pipelines have been automated. In most cases data has to be manually collected, preprocessed and further extended through data augmentation before they can be effective for training. Recently, special techniques for automating these tasks have emerged. The automation of data processing tasks is driven by the need to utilize large volumes of complex, heterogeneous data for machine learning and big data applications. Today, end-to-end automated data processing systems based on automated machine learning (AutoML) techniques are capable of taking raw data and transforming them into useful features for Big Data tasks by automating all intermediate processing stages. In this work, we present a thorough review of approaches for automating data processing tasks in deep learning pipelines, including automated data preprocessing--e.g., data cleaning, labeling, missing data imputation, and categorical data encoding--as well as data augmentation (including synthetic data generation using generative AI methods) and feature engineering--specifically, automated feature extraction, feature construction and feature selection. In addition to automating specific data processing tasks, we discuss the use of AutoML methods and tools to simultaneously optimize all stages of the machine learning pipeline.


Parallel computing in Python using Dask

#artificialintelligence

Parallel computing is an architecture in which several processors execute or process an application or computation simultaneously. Parallel computing helps in performing extensive calculations by dividing the workload between more than one processor, all of which work through the calculation at the same time. The primary goal of parallel computing is to increase available computation power for faster application processing and problem solving. In sequential computing, all the instructions run one after another without overlapping, whereas in parallel computing instructions run in parallel to complete the given task faster. Dask is a free and open-source library used to achieve parallel computing in Python. It works well with all the popular Python libraries like Pandas, Numpy, scikit-learns, etc.


Getting Down to Basics

Communications of the ACM

Writing the code to make a computer perform a particular job could be a Herculean task, back in the 1950s and 60s. "In the early 1950s, people did numerical computation by writing assembly language programs," says Alfred V. Aho, professor emeritus of computer science at Columbia University. "Assembly language is a language very close to the operations of a computer, and it's a deadly way to program. Of course, people can program at higher levels of abstraction, but that requires translating the higher-level language into a more basic set of instructions the machine can understand. Compilers that efficiently perform that translation exist nowadays in large part due to the work of Aho and Jeffrey D. Ullman, professor emeritus of computer science at Stanford University. Their contribution to both the theory and practice of computer languages has earned them the 2020 ACM A.M. Turing Award. "Compilers are responsible for generating the software that the world uses today, these trillion ...


The Anatomy of AI: Understanding Data Processing Tasks

#artificialintelligence

But as your data scientists and data engineers quickly realize, building a production AI system is a lot easier said than done, and there are many steps to master before you get that ML magic. At a high level, the anatomy of AI is fairly simple. You start with some data, train a machine learning model upon it, and then position the model to infer on real-world data. Unfortunately, as the old saying goes, the devil is in the details. And in the case of AI, there are a lot of small details you have to get right before you can claim victory.